-
Notifications
You must be signed in to change notification settings - Fork 7
bloom filter bulk inserts and queries #723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments. I have yet to look at the bulk insert code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! If we fix the test failure that I noted in my previous review, then we're good to go
CI failure due to flaky test fixed in #724. |
09afe55
to
49f8071
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with the new bulk query changes
loop2_prefetch :: KeyIx -> ST s () | ||
loop2_prefetch !kix | ||
| kix == ksN = pure () | ||
| otherwise = do | ||
let !keyhash = P.indexPrimArray keyhashes kix | ||
Bloom.prefetchElem filter keyhash | ||
loop2_prefetch (kix+1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message for 8eaf7ce says that the prefetch distance is the number of runs, but this looks like the prefetch distance is the number of keys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, oops. I kept changing my mind. I did both versions and benchmarked them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the commit message.
So in the benchmarks I did, nesting the loops this way round was slightly faster. But arguably, the other way round should have more stable prefetching behaviour, since it would depend on number of runs and not the lookup batch size.
Switching it round is not that hard. I could still do that. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've switched it round.
Instead of inserting keys into the bloom filter one by one as they are added to the run accumulator, save them up and add them all in one go when the page is being finalised. This then lets us use a bloom filter bulk insert, which lets us use memory prefetching. The result should be faster.
Fetch into the caches in the least intrusive way, "0" levels, rather than "3" levesl. This does not appear to slow down inserts, and should evict fewer things from the caches. And document what level "0" means and why we use it.
`bloomQueriesModel` was not really a proper model, because the model itself was using actual bloom filters. The model is now instead a `Set`, and `prop_bloomQueriesModel` is updated because the model will now only return true positives and negatives.
49f8071
to
6dac4ec
Compare
Previously, for the Classic Bloom filter implementation we had two different implementations of bloomQueries: one that was relatively simple and didn't rely on anything fancy, and one that went all out to maximise performance. The high performance one had to be disabled when we added the block-structured bloom filter implementation, since it was tightly coupled to that implementation. With the new bloom filter API and implementation, we can now implement a single high performance version of bulk query. We no longer need separate higher and lower performance versions, since we no longer need to rely on fancy features like unlifted boxed arrays. So strip out the bloom-query-fast cabal flag and the BloomFilterQuery2 module. The updated BloomFilterQuery1 does a simple nested loop: iterating over all filters and within that over all keys. It does two loops over the keys: one to prefetch the key for the filter, and a second one to do the lookups for real. Thus the prefetch distance is the number of keys, which is of course somewhat variable. In benchmarks this version was slightly faster than nesting the loops the other way around, but perhaps the other way around would have more stable prefetching behaviour over a wider range of key batch sizes. Changing this would not be too hard.
The query module used to be big, and there used to be two of them. Now there's only one and it's a lot smaller. So it makes sense to keep it all together in one module.
We were using a mix of import qualified as BF, and import as Bloom.
6dac4ec
to
ef64262
Compare
This means our prefetching is done with a prefetch distance of the number of runs, which is more stable than the key batch size. Also tweak the code so we get only joinrec and not letrec in the core.
ef64262
to
a02e702
Compare
Two changes, both from bulk operations for bloom filters.
For Bloom filter inserts in run accumulation: instead of inserting keys into the bloom filter one by one as they are added to the run accumulator, save them up and add them all in one go when the page is being finalised. This then lets us use a bloom filter bulk insert, which lets us use memory prefetching.
For Bloom filter queries in key lookups, update the existing bulk query code to properly take advantage of the new API and use prefetching. We can now also simplify and use a single high performance implementation, rather than needing two (a more compatible one and a faster one that relied on fancier features available in later GHC versions).
Results for the WP8 benchmark (100M elements, 10bits per key, full caching, 10k batches of 256 keys):
So overall about a 16% improvement in ops/sec on the primary WP8 benchmark, and as a bonus, getting over the magic 100k ops/sec threshold (on my laptop).